Data-driven automated control algorithm for floating-zone crystal growth derived by reinforcement learning

The complete automation of materials manufacturing with high productivity is a key problem in some materials processing. In floating zone (FZ) crystal growth, which is a manufacturing process for semiconductor wafers such as silicon, an operator adaptively controls the input parameters in accordance with the state of the crystal growth process. Since the operation dynamics of FZ crystal growth are complicated, automation is often difficult, and usually the process is manually controlled. Here we demonstrate automated control of FZ crystal growth by reinforcement learning using the dynamics predicted by Gaussian mixture modeling (GMM) from small numbers of trajectories. Our proposed method of constructing the control model is completely data-driven. Using an emulator program for FZ crystal growth, we show that the control model constructed by our proposed model can more accurately follow the ideal growth trajectory than demonstration trajectories created by human operation. Furthermore, we reveal that policy optimization near the demonstration trajectories realizes accurate control following the ideal trajectory.

www.nature.com/scientificreports/ recently we proposed adaptation of the Gaussian mixture model (GMM) to predict the dynamics of FZ crystal growth, and demonstrated that GMM can precisely predict the operation trajectories from only five trajectories used for training 34 . In the present study, we constructed a control model by reinforcement learning using proximal policy optimization (PPO) and dynamics predicted by GMM.

Reinforcement learning by PPO with GMM dynamics.
For control of FZ crystal growth with a small number of demonstration trajectories, we applied reinforcement learning by PPO with the dynamics predicted by GMM. Here we describe how to construct a control model for FZ crystal growth combining GMM and PPO based on the literature 35 . The state of the floating-zone melt at time (t + 1), which is assumed to be composed of the height (h) and diameter of the grown crystal (d) and described as s t+1 = (h t+1 , d t+1 ), is determined by the state of the melt at time t (s t ), and input parameters, which include the power (P) and the movement speed of the feed (v), for example, and described as a t = (P t , v t ).
f stands for the true dynamics for FZ crystal growth. Once the GMM is constructed from the demonstration trajectories, the state of the melt at time (t + 1) can be predicted by the state of the melt and the input parameters at time t: The circumflex (^) represents that the value is predicted, and f GMM stands for a dynamics model trained by GMM. The details of the training of GMM are described in Ref. 34 . In PPO, the parameterized policies function π θ p (a t |s t ) with parameter vector θ p , which generates input values a t from the current state x t as a probability distribution, is iteratively optimized using a clipped surrogate objective L CLIP θ p instead of a policy gradient [35][36][37] .
∈ is a hyper-parameter determining a clipped region. A(s t , a t ) is the advantage function described as follows: (1) s t+1 = f (s t , a t ).
(3) L CLIP θ p =Ê s t ,a t min r s t , a t , θ p Â (s t , a t ), clip r s t , a t , θ p , 1 − ε, 1 + ε Â (s t , a t ) (4) r s t , a t , θ p = π θ p (a t |s t ) π old θ p (a t |s t ) . www.nature.com/scientificreports/ where Q(s t , a t ) is the state-action value function and V (s t ) is the state-value function. Here we approximately represent Q(s t , a t ) as follows: where R t (s t , a t ) and γ are the reward function and the discount factor, respectively. The advantage function represents whether the action in which the input value a t is set under the melt state described as s t is preferable. When the action is preferable, the advantage function takes on a positive value and the policy is updated to increase the probability ratio r t θ p by maximizing the surrogate objective. On the other hand, the advantage function takes on a negative value and the policy is updated to decrease the probability ratio when the action is not preferable. Under conditions that the policy and dynamics are given, state sequences are generated as a probability distribution, and a state-value function can be calculated: where T is the length of the trajectories and the expected value is calculated over the probability distribution of the state sequences. In PPO, the state-value function is predicted from the training data without assigning a policy. Thus, the predicted state-value function parameterized with θ v V θ v (s t ) is optimized using the squareerror loss L VF (θ v ); Once the state-value function is predicted, the action-value function Q (s t , a t ) and the advantage function Â t are also predicted by eqs. (6) and (5), respectively. In addition to the clipped surrogate objective and the state-value function error, an entropy bonus is added to ensure sufficient exploration and the following objective is maximized for each iteration in PPO 38 : where c 1 and c 2 are weights. Maximizing L CLIP θ p means acquiring the optimized policy π θ p (a t |s t ) as described in Eq. (3) and (4). Minimizing L VF (θ v ) means that the state-value function is predicted without assuming a policy as described in Eq. (8). Maximizing S π θ p (s t ) is an entropy of policy that is a regularization term for training. In PPO, θ p , θ v is simultaneously optimized in each iteration. Although L CLIP depends on θ v via A(s t , a t ) and L VF depends on θ p via V π (s t ) , in the iterative optimization process, θ v in L CLIP and θ p in L VF are regarded as constant values and not optimized, and the values of the previous step are applied.
In order to optimize the policy, it is necessary to specify the dynamics to calculate the state-value function by Eq. (7). In our algorithm, GMM dynamics were used for calculation of the state-value function. Thus, the algorithm is completely data-driven without any simulations, which is different from other methods such as the "sim-to-real" approach 39,40 . However, the GMM dynamics can reliably predict actual dynamics only in the vicinity of the training trajectories. Therefore, we proposed a method to optimize the policy near the training trajectories, where GMM dynamics reliably predict the actual dynamics, and obtain a policy that can transfer to actual FZ crystal growth. To search the policy space near the training trajectories, firstly, we performed pretraining to make the policy closer to the training trajectories. Secondly, we introduced the error from the averaged action sequences to the reward function in addition to the error from the ideal trajectory in the diameter d ideal t . The reward function used in our proposed algorithm is as follows: a * t and denote the averaged action sequences of training trajectories and a weight.
Policy optimization. Prior to the reinforcement learning, we constructed a data-driven prediction model for FZ crystal growth by GMM as we previously reported 34 . The number of Gaussian mixtures, which is a hyperparameter of GMM, was set to 50. Since the prediction of the dynamics by GMM is reliable only near the training trajectories, the accuracy of the prediction is significantly poorer when the trajectories deviate greatly from the ideal trajectory as discussed in "Results and discussion" section especially with showing Fig. 4 in detail. If we start to optimize with the random default policy, the state sequences generated by GMM will be far from the actual state sequences and fail to reach the ideal trajectory shown in Fig. 2a. Thus, we performed pretraining using the training trajectories before optimization of the policy by PPO. In the pretraining, the policy was trained to become closer to the averaged action sequences of the training trajectories. The following loss function is minimized in the pretraining: www.nature.com/scientificreports/ where σ and μ θ p (s t ) represent the variance parameter and the predicted averaged values of inputs values under the state s * t in a training trajectory. μ θ p (s t ) and V θ v (s t ) are modeled by neural networks. The number, node number, and activation function of the hidden layers are 2, 64, and hyperbolic tangent (tanh), respectively. A sigmoid function is used as the activation function of the output layer of the policy network, and the output layer of the networks of the state-value function has no activation function. Both networks share weight values, except for the output layers. Training of the neural networks was performed by the Adam method with a learning rate of 1 × 10 -5 and a batch size of 128 41 . The probabilistic policy was generated by the μ θ p (s t ) and variance parameters.
The detailed algorithm for pretraining the policy and state-value function is shown in Algorithm 1. After the pretraining of the policy, the policy was optimized by PPO while maximizing the objective shown in Eq. (8). Hyper-parameters used for the pretraining and training by PPO are summarized in Table 1. Our program about PPO for the FZ crystal growth trajectory is uploaded in GitHub 42 . Figure 3 shows the results of automated control by the trained policy with our proposed algorithm. Note that the training of the policy was performed by the dynamics predicted by GMM from only the training trajectories. The obtained trajectory follows the ideal trajectory well in terms of diameter. Table 2 summarizes the mean square error (MSE) from the ideal trajectory in diameter d for control by PPO and by humans (training trajectories). The deviation from the ideal trajectory for control by PPO is smaller than that for human control.

Results and discussion
π θ p (a t |s t ) = Gauss a t |μ θ p (s t ), σ I Table 1. Hyper-parameters used for the pretraining and training by PPO.   www.nature.com/scientificreports/ We successfully constructed a control algorithm for FZ crystal growth with a defined ideal shape from several training trajectories. Pretraining of the policy before PPO is crucially important. Without pretraining, the learning of policy never progresses at all. Figure 4 shows the evolution of the averaged absolute error from the ideal trajectory in diameter d during training starting after pretraining and with randomly set initial values. With pretraining, the policy was well trained and the error decreased with increasing iteration and became saturated. On the other hand, the  www.nature.com/scientificreports/ error from the ideal trajectory never decreased with increasing iteration without pretraining. Furthermore, the error of GMM dynamics from the true dynamics along the generated trajectory was consistently higher without pretraining than that after pretraining. These results indicate that the policy space was appropriately searched with GMM dynamics with high accuracy after the pretraining. Design of the reward function, adding the error from the averaged action sequences in addition to the error from the ideal trajectory, is also important for policy optimization. Without the second term in Eq. (11), the deviation from the ideal trajectory is larger than our proposed reward shown in Eq. (11), especially around t = 400 and t > 600 (Fig. 5a). In these periods, the error of GMM dynamics for the trajectory generated by the reward without the second term in Eq. (11) is higher than that for the trajectory generated by our reward function (Fig. 5b). These results indicate that adding the second term in Eq. (11) successfully achieves optimization of the policy with the GMM dynamics within high accuracy by proper setting of the reward function.
The current demonstration shows that automated control of FZ crystal growth is possible by our proposed method from a small number of demonstration trajectories. Since our methods determine the policy based on the dynamics predicted by GMM, it is necessary to make the generated trajectory closer to the demonstration trajectory during policy optimization. Pretraining of the policy and proper design of the reward function successfully achieve optimization of the policy by the GMM dynamics within reliable prediction margins. Our proposed method will be able to be applied to other materials processes that require adaptive control according to the process status. Although the present demonstration was based on data obtained by an emulator program, our proposed methodology will work with actual FZ crystal growth.

Conclusion
We have constructed a control model for FZ crystal growth by reinforcement learning using PPO with dynamics predicted by GMM. Our proposed method is completely data-driven and can construct the control model from only a small number of demonstration trajectories. We have verified our method to by a virtual experiment using the emulator program of FZ crystal growth. As a result, the control model was revealed to operate more accurately to follow an ideal trajectory in melt diameter than demonstration trajectories created by human operation. Since our methods determine the policy based on the dynamics predicted by GMM, it is necessary to make the generated trajectory closer to the demonstration trajectory during policy optimization. Pretraining of the policy near training trajectories and proper design of the reward function successfully achieved optimization of the policy by GMM dynamics within reliable prediction margins. Our proposed method will lead to the automation of materials processing in which adaptive operation is required and help realize high productivity in materials manufacturing. It is expected that the actual FZ crystal growth process can be automated from small number of demonstration trajectories operated by human.

Data availability
The data that support the findings of this study are available from the corresponding author, SH, upon reasonable request.